Pesquisa | BVS - MINISTÉRIO DA SAÚDE

1.

Centralized and Federated Models for the Analysis of Clinical Data.

Li, Ruowang; Romano, Joseph D; Chen, Yong; Moore, Jason H.

Annu Rev Biomed Data Sci ; 2024 May 09.

Artigo em Inglês | MEDLINE | ID: mdl-38723657

RESUMO

The progress of precision medicine research hinges on the gathering and analysis of extensive and diverse clinical datasets. With the continued expansion of modalities, scales, and sources of clinical datasets, it becomes imperative to devise methods for aggregating information from these varied sources to achieve a comprehensive understanding of diseases. In this review, we describe two important approaches for the analysis of diverse clinical datasets, namely the centralized model and federated model. We compare and contrast the strengths and weaknesses inherent in each model and present recent progress in methodologies and their associated challenges. Finally, we present an outlook on the opportunities that both models hold for the future analysis of clinical data.

2.

The Alzheimer's Knowledge Base: A Knowledge Graph for Alzheimer Disease Research.

Romano, Joseph D; Truong, Van; Kumar, Rachit; Venkatesan, Mythreye; Graham, Britney E; Hao, Yun; Matsumoto, Nick; Li, Xi; Wang, Zhiping; Ritchie, Marylyn D; Shen, Li; Moore, Jason H.

J Med Internet Res ; 26: e46777, 2024 Apr 18.

Artigo em Inglês | MEDLINE | ID: mdl-38635981

RESUMO

BACKGROUND: As global populations age and become susceptible to neurodegenerative illnesses, new therapies for Alzheimer disease (AD) are urgently needed. Existing data resources for drug discovery and repurposing fail to capture relationships central to the disease's etiology and response to drugs. OBJECTIVE: We designed the Alzheimer's Knowledge Base (AlzKB) to alleviate this need by providing a comprehensive knowledge representation of AD etiology and candidate therapeutics. METHODS: We designed the AlzKB as a large, heterogeneous graph knowledge base assembled using 22 diverse external data sources describing biological and pharmaceutical entities at different levels of organization (eg, chemicals, genes, anatomy, and diseases). AlzKB uses a Web Ontology Language 2 ontology to enforce semantic consistency and allow for ontological inference. We provide a public version of AlzKB and allow users to run and modify local versions of the knowledge base. RESULTS: AlzKB is freely available on the web and currently contains 118,902 entities with 1,309,527 relationships between those entities. To demonstrate its value, we used graph data science and machine learning to (1) propose new therapeutic targets based on similarities of AD to Parkinson disease and (2) repurpose existing drugs that may treat AD. For each use case, AlzKB recovers known therapeutic associations while proposing biologically plausible new ones. CONCLUSIONS: AlzKB is a new, publicly available knowledge resource that enables researchers to discover complex translational associations for AD drug discovery. Through 2 use cases, we show that it is a valuable tool for proposing novel therapeutic hypotheses based on public biomedical knowledge.

Assuntos

Doença de Alzheimer , Humanos , Doença de Alzheimer/tratamento farmacológico , Doença de Alzheimer/genética , Reconhecimento Automatizado de Padrão , Bases de Conhecimento , Aprendizado de Máquina , Conhecimento

3.

AI-luminating Artificial Intelligence in Inflammatory Bowel Diseases: A Narrative Review on the Role of AI in Endoscopy, Histology, and Imaging for IBD.

Gu, Phillip; Mendonca, Oreen; Carter, Dan; Dube, Shishir; Wang, Paul; Huang, Xiuzhen; Li, Debiao; Moore, Jason H; McGovern, Dermot P B.

Inflamm Bowel Dis ; 2024 Mar 07.

Artigo em Inglês | MEDLINE | ID: mdl-38452040

RESUMO

Endoscopy, histology, and cross-sectional imaging serve as fundamental pillars in the detection, monitoring, and prognostication of inflammatory bowel disease (IBD). However, interpretation of these studies often relies on subjective human judgment, which can lead to delays, intra- and interobserver variability, and potential diagnostic discrepancies. With the rising incidence of IBD globally coupled with the exponential digitization of these data, there is a growing demand for innovative approaches to streamline diagnosis and elevate clinical decision-making. In this context, artificial intelligence (AI) technologies emerge as a timely solution to address the evolving challenges in IBD. Early studies using deep learning and radiomics approaches for endoscopy, histology, and imaging in IBD have demonstrated promising results for using AI to detect, diagnose, characterize, phenotype, and prognosticate IBD. Nonetheless, the available literature has inherent limitations and knowledge gaps that need to be addressed before AI can transition into a mainstream clinical tool for IBD. To better understand the potential value of integrating AI in IBD, we review the available literature to summarize our current understanding and identify gaps in knowledge to inform future investigations.

4.

Artificial intelligence and technology collaboratories: Empowering innovation in AI + AgeTech.

Li, Rose M; Abadir, Peter M; Battle, Alexis; Chellappa, Rama; Choudhry, Niteesh K; Demiris, George; Ganesan, Deepak; Karlawish, Jason; Moore, Jason H; Walston, Jeremy D.

J Am Geriatr Soc ; 2024 Feb 26.

Artigo em Inglês | MEDLINE | ID: mdl-38407353

5.

Artificial Intelligence and Technology Collaboratories: Innovating aging research and Alzheimer's care.

Abadir, Peter; Oh, Esther; Chellappa, Rama; Choudhry, Niteesh; Demiris, George; Ganesan, Deepak; Karlawish, Jason; Marlin, Benjamin; Li, Rose M; Dehak, Najim; Arbaje, Alicia; Unberath, Mathias; Cudjoe, Thomas; Chute, Christopher; Moore, Jason H; Phan, Phillip; Samus, Quincy; Schoenborn, Nancy L; Battle, Alexis; Walston, Jeremy D.

Alzheimers Dement ; 20(4): 3074-3079, 2024 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-38324244

RESUMO

This perspective outlines the Artificial Intelligence and Technology Collaboratories (AITC) at Johns Hopkins University, University of Pennsylvania, and University of Massachusetts, highlighting their roles in developing AI-based technologies for older adult care, particularly targeting Alzheimer's disease (AD). These National Institute on Aging (NIA) centers foster collaboration among clinicians, gerontologists, ethicists, business professionals, and engineers to create AI solutions. Key activities include identifying technology needs, stakeholder engagement, training, mentoring, data integration, and navigating ethical challenges. The objective is to apply these innovations effectively in real-world scenarios, including in rural settings. In addition, the AITC focuses on developing best practices for AI application in the care of older adults, facilitating pilot studies, and addressing ethical concerns related to technology development for older adults with cognitive impairment, with the ultimate aim of improving the lives of older adults and their caregivers. HIGHLIGHTS: Addressing the complex needs of older adults with Alzheimer's disease (AD) requires a comprehensive approach, integrating medical and social support. Current gaps in training, techniques, tools, and expertise hinder uniform access across communities and health care settings. Artificial intelligence (AI) and digital technologies hold promise in transforming care for this demographic. Yet, transitioning these innovations from concept to marketable products presents significant challenges, often stalling promising advancements in the developmental phase. The Artificial Intelligence and Technology Collaboratories (AITC) program, funded by the National Institute on Aging (NIA), presents a viable model. These Collaboratories foster the development and implementation of AI methods and technologies through projects aimed at improving care for older Americans, particularly those with AD, and promote the sharing of best practices in AI and technology integration. Why Does This Matter? The National Institute on Aging (NIA) Artificial Intelligence and Technology Collaboratories (AITC) program's mission is to accelerate the adoption of artificial intelligence (AI) and new technologies for the betterment of older adults, especially those with dementia. By bridging scientific and technological expertise, fostering clinical and industry partnerships, and enhancing the sharing of best practices, this program can significantly improve the health and quality of life for older adults with Alzheimer's disease (AD).

Assuntos

Doença de Alzheimer , Isotiocianatos , Estados Unidos , Humanos , Idoso , Doença de Alzheimer/terapia , Inteligência Artificial , Gerociência , Qualidade de Vida , Tecnologia

6.

Interaction models matter: an efficient, flexible computational framework for model-specific investigation of epistasis.

Batista, Sandra; Madar, Vered Senderovich; Freda, Philip J; Bhandary, Priyanka; Ghosh, Attri; Matsumoto, Nicholas; Chitre, Apurva S; Palmer, Abraham A; Moore, Jason H.

BioData Min ; 17(1): 7, 2024 Feb 28.

Artigo em Inglês | MEDLINE | ID: mdl-38419006

RESUMO

PURPOSE: Epistasis, the interaction between two or more genes, is integral to the study of genetics and is present throughout nature. Yet, it is seldom fully explored as most approaches primarily focus on single-locus effects, partly because analyzing all pairwise and higher-order interactions requires significant computational resources. Furthermore, existing methods for epistasis detection only consider a Cartesian (multiplicative) model for interaction terms. This is likely limiting as epistatic interactions can evolve to produce varied relationships between genetic loci, some complex and not linearly separable. METHODS: We present new algorithms for the interaction coefficients for standard regression models for epistasis that permit many varied models for the interaction terms for loci and efficient memory usage. The algorithms are given for two-way and three-way epistasis and may be generalized to higher order epistasis. Statistical tests for the interaction coefficients are also provided. We also present an efficient matrix based algorithm for permutation testing for two-way epistasis. We offer a proof and experimental evidence that methods that look for epistasis only at loci that have main effects may not be justified. Given the computational efficiency of the algorithm, we applied the method to a rat data set and mouse data set, with at least 10,000 loci and 1,000 samples each, using the standard Cartesian model and the XOR model to explore body mass index. RESULTS: This study reveals that although many of the loci found to exhibit significant statistical epistasis overlap between models in rats, the pairs are mostly distinct. Further, the XOR model found greater evidence for statistical epistasis in many more pairs of loci in both data sets with almost all significant epistasis in mice identified using XOR. In the rat data set, loci involved in epistasis under the XOR model are enriched for biologically relevant pathways. CONCLUSION: Our results in both species show that many biologically relevant epistatic relationships would have been undetected if only one interaction model was applied, providing evidence that varied interaction models should be implemented to explore epistatic interactions that occur in living systems.

7.

The Molecular Twin artificial-intelligence platform integrates multi-omic data to predict outcomes for pancreatic adenocarcinoma patients.

Osipov, Arsen; Nikolic, Ognjen; Gertych, Arkadiusz; Parker, Sarah; Hendifar, Andrew; Singh, Pranav; Filippova, Darya; Dagliyan, Grant; Ferrone, Cristina R; Zheng, Lei; Moore, Jason H; Tourtellotte, Warren; Van Eyk, Jennifer E; Theodorescu, Dan.

Nat Cancer ; 5(2): 299-314, 2024 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-38253803

RESUMO

Contemporary analyses focused on a limited number of clinical and molecular biomarkers have been unable to accurately predict clinical outcomes in pancreatic ductal adenocarcinoma. Here we describe a precision medicine platform known as the Molecular Twin consisting of advanced machine-learning models and use it to analyze a dataset of 6,363 clinical and multi-omic molecular features from patients with resected pancreatic ductal adenocarcinoma to accurately predict disease survival (DS). We show that a full multi-omic model predicts DS with the highest accuracy and that plasma protein is the top single-omic predictor of DS. A parsimonious model learning only 589 multi-omic features demonstrated similar predictive performance as the full multi-omic model. Our platform enables discovery of parsimonious biomarker panels and performance assessment of outcome prediction models learning from resource-intensive panels. This approach has considerable potential to impact clinical care and democratize precision cancer medicine worldwide.

Assuntos

Adenocarcinoma , Carcinoma Ductal Pancreático , Neoplasias Pancreáticas , Humanos , Adenocarcinoma/genética , Adenocarcinoma/cirurgia , Neoplasias Pancreáticas/genética , Neoplasias Pancreáticas/cirurgia , Multiômica , Inteligência Artificial , Carcinoma Ductal Pancreático/genética , Carcinoma Ductal Pancreático/cirurgia , Inteligência

8.

Artificial intelligence: revolutionizing cardiology with large language models.

Boonstra, Machteld J; Weissenbacher, Davy; Moore, Jason H; Gonzalez-Hernandez, Graciela; Asselbergs, Folkert W.

Eur Heart J ; 45(5): 332-345, 2024 Feb 01.

Artigo em Inglês | MEDLINE | ID: mdl-38170821

RESUMO

Natural language processing techniques are having an increasing impact on clinical care from patient, clinician, administrator, and research perspective. Among others are automated generation of clinical notes and discharge letters, medical term coding for billing, medical chatbots both for patients and clinicians, data enrichment in the identification of disease symptoms or diagnosis, cohort selection for clinical trial, and auditing purposes. In the review, an overview of the history in natural language processing techniques developed with brief technical background is presented. Subsequently, the review will discuss implementation strategies of natural language processing tools, thereby specifically focusing on large language models, and conclude with future opportunities in the application of such techniques in the field of cardiology.

Assuntos

Inteligência Artificial , Cardiologia , Humanos , Processamento de Linguagem Natural , Alta do Paciente

9.

mixWAS: An efficient distributed algorithm for mixed-outcomes genome-wide association studies.

Li, Ruowang; Benz, Luke; Duan, Rui; Denny, Joshua C; Hakonarson, Hakon; Mosley, Jonathan D; Smoller, Jordan W; Wei, Wei-Qi; Ritchie, Marylyn D; Moore, Jason H; Chen, Yong.

medRxiv ; 2024 Jan 10.

Artigo em Inglês | MEDLINE | ID: mdl-38260403

RESUMO

Genome-wide association studies (GWAS) have been instrumental in identifying genetic associations for various diseases and traits. However, uncovering genetic underpinnings among traits beyond univariate phenotype associations remains a challenge. Multi-phenotype associations (MPA), or genetic pleiotropy, offer important insights into shared genes and pathways among traits, enhancing our understanding of genetic architectures of complex diseases. GWAS of biobank-linked electronic health record (EHR) data are increasingly being utilized to identify MPA among various traits and diseases. However, methodologies that can efficiently take advantage of distributed EHR to detect MPA are still lacking. Here, we introduce mixWAS, a novel algorithm that efficiently and losslessly integrates multiple EHRs via summary statistics, allowing the detection of MPA among mixed phenotypes while accounting for heterogeneities across EHRs. Simulations demonstrate that mixWAS outperforms the widely used MPA detection method, Phenome-wide association study (PheWAS), across diverse scenarios. Applying mixWAS to data from seven EHRs in the US, we identified 4,534 MPA among blood lipids, BMI, and circulatory diseases. Validation in an independent EHR data from UK confirmed 97.7% of the associations. mixWAS fundamentally improves the detection of MPA and is available as a free, open-source software.

10.

Risk prediction: Methods, Challenges, and Opportunities.

Li, Ruowang; Duan, Rui; He, Lifang; Moore, Jason H.

Pac Symp Biocomput ; 29: 650-653, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38160314

RESUMO

The following sections are included:Introduction to the workshopWorkshop Presenters.

11.

Cluster Analysis reveals Socioeconomic Disparities among Elective Spine Surgery Patients.

Orlenko, Alena; Freda, Philip J; Ghosh, Attri; Choi, Hyunjun; Matsumoto, Nicholas; Bright, Tiffani J; Walker, Corey T; Obafemi-Ajayi, Tayo; Moore, Jason H.

Pac Symp Biocomput ; 29: 359-373, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38160292

RESUMO

This work demonstrates the use of cluster analysis in detecting fair and unbiased novel discoveries. Given a sample population of elective spinal fusion patients, we identify two overarching subgroups driven by insurance type. The Medicare group, associated with lower socioeconomic status, exhibited an over-representation of negative risk factors. The findings provide a compelling depiction of the interwoven socioeconomic and racial disparities present within the healthcare system, highlighting their consequential effects on health inequalities. The results are intended to guide design of fair and precise machine learning models based on intentional integration of population stratification.

Assuntos

Medicare , Disparidades Socioeconômicas em Saúde , Idoso , Humanos , Estados Unidos , Biologia Computacional , Grupos Raciais , Análise por Conglomerados

12.

SynTwin: A graph-based approach for predicting clinical outcomes using digital twins derived from synthetic patients.

Moore, Jason H; Li, Xi; Chang, Jui-Hsuan; Tatonetti, Nicholas P; Theodorescu, Dan; Chen, Yong; Asselbergs, Folkert W; Venkatesan, Mythreye; Wang, Zhiping Paul.

Pac Symp Biocomput ; 29: 96-107, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38160272

RESUMO

The concept of a digital twin came from the engineering, industrial, and manufacturing domains to create virtual objects or machines that could inform the design and development of real objects. This idea is appealing for precision medicine where digital twins of patients could help inform healthcare decisions. We have developed a methodology for generating and using digital twins for clinical outcome prediction. We introduce a new approach that combines synthetic data and network science to create digital twins (i.e. SynTwin) for precision medicine. First, our approach starts by estimating the distance between all subjects based on their available features. Second, the distances are used to construct a network with subjects as nodes and edges defining distance less than the percolation threshold. Third, communities or cliques of subjects are defined. Fourth, a large population of synthetic patients are generated using a synthetic data generation algorithm that models the correlation structure of the data to generate new patients. Fifth, digital twins are selected from the synthetic patient population that are within a given distance defining a subject community in the network. Finally, we compare and contrast community-based prediction of clinical endpoints using real subjects, digital twins, or both within and outside of the community. Key to this approach are the digital twins defined using patient similarity that represent hypothetical unobserved patients with patterns similar to nearby real patients as defined by network distance and community structure. We apply our SynTwin approach to predicting mortality in a population-based cancer registry (n=87,674) from the Surveillance, Epidemiology, and End Results (SEER) program from the National Cancer Institute (USA). Our results demonstrate that nearest network neighbor prediction of mortality in this study is significantly improved with digital twins (AUROC=0.864, 95% CI=0.857-0.872) over just using real data alone (AUROC=0.791, 95% CI=0.781-0.800). These results suggest a network-based digital twin strategy using synthetic patients may add value to precision medicine efforts.

Assuntos

Algoritmos , Biologia Computacional , Humanos , Análise por Conglomerados , Medicina de Precisão

13.

Ten simple rules for managing laboratory information.

Berezin, Casey-Tyler; Aguilera, Luis U; Billerbeck, Sonja; Bourne, Philip E; Densmore, Douglas; Freemont, Paul; Gorochowski, Thomas E; Hernandez, Sarah I; Hillson, Nathan J; King, Connor R; Köpke, Michael; Ma, Shuyi; Miller, Katie M; Moon, Tae Seok; Moore, Jason H; Munsky, Brian; Myers, Chris J; Nicholas, Dequina A; Peccoud, Samuel J; Zhou, Wen; Peccoud, Jean.

PLoS Comput Biol ; 19(12): e1011652, 2023 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-38060459

RESUMO

Information is the cornerstone of research, from experimental (meta)data and computational processes to complex inventories of reagents and equipment. These 10 simple rules discuss best practices for leveraging laboratory information management systems to transform this large information load into useful scientific findings.

14.

Publisher Correction: Quantifying and correcting bias due to outcome dependent self-reported weights in longitudinal study of weight loss interventions.

Tong, Jiayi; Duan, Rui; Li, Ruowang; Luo, Chongliang; Moore, Jason H; Zhu, Jingsan; Foster, Gary D; Volpp, Kevin G; Yancy, William S; Shaw, Pamela A; Chen, Yong.

Sci Rep ; 13(1): 22546, 2023 Dec 18.

Artigo em Inglês | MEDLINE | ID: mdl-38110504

15.

Quantifying and correcting bias due to outcome dependent self-reported weights in longitudinal study of weight loss interventions.

Tong, Jiayi; Duan, Rui; Li, Ruowang; Luo, Chongliang; Moore, Jason H; Zhu, Jingsan; Foster, Gary D; Volpp, Kevin G; Yancy, William S; Shaw, Pamela A; Chen, Yong.

Sci Rep ; 13(1): 19078, 2023 11 04.

Artigo em Inglês | MEDLINE | ID: mdl-37925516

RESUMO

In response to the escalating global obesity crisis and its associated health and financial burdens, this paper presents a novel methodology for analyzing longitudinal weight loss data and assessing the effectiveness of financial incentives. Drawing from the Keep It Off trial-a three-arm randomized controlled study with 189 participants-we examined the potential impact of financial incentives on weight loss maintenance. Given that some participants choose not to weigh themselves because of small weight change or weight gains, which is a common phenomenon in many weight-loss studies, traditional methods, for example, the Generalized Estimating Equations (GEE) method tends to overestimate the effect size due to the assumption that data are missing completely at random. To address this challenge, we proposed a framework which can identify evidence of missing not at random and conduct bias correction using the estimating equation derived from pairwise composite likelihood. By analyzing the Keep It Off data, we found that the data in this trial are most likely characterized by non-random missingness. Notably, we also found that the enrollment time (i.e., duration time) would be positively associated with the weight loss maintenance after adjusting for the baseline participant characteristics (e.g., age, sex). Moreover, the lottery-based intervention was found to be more effective in weight loss maintenance compared with the direct payment intervention, though the difference was non-statistically significant. This framework's significance extends beyond weight loss research, offering a semi-parametric approach to assess missing data mechanisms and robustly explore associations between exposures (e.g., financial incentives) and key outcomes (e.g., weight loss maintenance). In essence, the proposed methodology provides a powerful toolkit for analyzing real-world longitudinal data, particularly in scenarios with data missing not at random, enriching comprehension of intricate dataset dynamics.

Assuntos

Projetos de Pesquisa , Redução de Peso , Humanos , Viés , Estudos Longitudinais , Autorrelato , Ensaios Clínicos Controlados Aleatórios como Assunto

16.

Exploring genetic influences on adverse outcome pathways using heuristic simulation and graph data science.

Romano, Joseph D; Mei, Liang; Senn, Jonathan; Moore, Jason H; Mortensen, Holly M.

Comput Toxicol ; 252023 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-37829618

RESUMO

Adverse outcome pathways provide a powerful tool for understanding the biological signaling cascades that lead to disease outcomes following toxicity. The framework outlines downstream responses known as key events, culminating in a clinically significant adverse outcome as a final result of the toxic exposure. Here we use the AOP framework combined with artificial intelligence methods to gain novel insights into genetic mechanisms that underlie toxicity-mediated adverse health outcomes. Specifically, we focus on liver cancer as a case study with diverse underlying mechanisms that are clinically significant. Our approach uses two complementary AI techniques: Generative modeling via automated machine learning and genetic algorithms, and graph machine learning. We used data from the US Environmental Protection Agency's Adverse Outcome Pathway Database (AOP-DB; aopdb.epa.gov) and the UK Biobank's genetic data repository. We use the AOP-DB to extract disease-specific AOPs and build graph neural networks used in our final analyses. We use the UK Biobank to retrieve real-world genotype and phenotype data, where genotypes are based on single nucleotide polymorphism data extracted from the AOP-DB, and phenotypes are case/control cohorts for the disease of interest (liver cancer) corresponding to those adverse outcome pathways. We also use propensity score matching to appropriately sample based on important covariates (demographics, comorbidities, and social deprivation indices) and to balance the case and control populations in our machine language training/testing datasets. Finally, we describe a novel putative risk factor for LC that depends on genetic variation in both the aryl-hydrocarbon receptor (AHR) and ATP binding cassette subfamily B member 11 (ABCB11) genes.

17.

Aliro: an automated machine learning tool leveraging large language models.

Choi, Hyunjun; Moran, Jay; Matsumoto, Nicholas; Hernandez, Miguel E; Moore, Jason H.

Bioinformatics ; 39(10)2023 10 03.

Artigo em Inglês | MEDLINE | ID: mdl-37796839

RESUMO

MOTIVATION: Biomedical and healthcare domains generate vast amounts of complex data that can be challenging to analyze using machine learning tools, especially for researchers without computer science training. RESULTS: Aliro is an open-source software package designed to automate machine learning analysis through a clean web interface. By infusing the power of large language models, the user can interact with their data by seamlessly retrieving and executing code pulled from the large language model, accelerating automated discovery of new insights from data. Aliro includes a pre-trained machine learning recommendation system that can assist the user to automate the selection of machine learning algorithms and its hyperparameters and provides visualization of the evaluated model and data. AVAILABILITY AND IMPLEMENTATION: Aliro is deployed by running its custom Docker containers. Aliro is available as open-source from GitHub at: https://github.com/EpistasisLab/Aliro.

Assuntos

Algoritmos , Software , Aprendizado de Máquina , Idioma

18.

STAR_outliers: a python package that separates univariate outliers from non-normal distributions.

Gregg, John T; Moore, Jason H.

BioData Min ; 16(1): 25, 2023 Sep 04.

Artigo em Inglês | MEDLINE | ID: mdl-37667378

RESUMO

There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes significantly closer to 0.7 percent of values from these features than other outlier removal methods on average.Conclusions STAR_outliers is an easily implemented python package for removing outliers that outperforms multiple commonly used methods of univariate outlier removal.

19.

Improving Genetic Association Studies with a Novel Methodology that Unveils the Hidden Complexity of All-Cause Heart Failure.

Gregg, John T; Himes, Blanca E; Asselbergs, Folkert W; Moore, Jason H.

medRxiv ; 2023 Aug 04.

Artigo em Inglês | MEDLINE | ID: mdl-37577697

RESUMO

Motivation: Genome-Wide Association Studies (GWAS) commonly assume phenotypic and genetic homogeneity that is not present in complex conditions. We designed Transformative Regression Analysis of Combined Effects (TRACE), a GWAS methodology that better accounts for clinical phenotype heterogeneity and identifies gene-by-environment (GxE) interactions. We demonstrated with UK Biobank (UKB) data that TRACE increased the variance explained in All-Cause Heart Failure (AHF) via the discovery of novel single nucleotide polymorphism (SNP) and SNP-by-environment (i.e. GxE) interaction associations. First, we transformed 312 AHF-related ICD10 codes (including AHF) into continuous low-dimensional features (i.e., latent phenotypes) for a more nuanced disease representation. Then, we ran a standard GWAS on our latent phenotypes to discover main effects and identified GxE interactions with target encoding. Genes near associated SNPs subsequently underwent enrichment analysis to explore potential functional mechanisms underlying associations. Latent phenotypes were regressed against their SNP hits and the estimated latent phenotype values were used to measure the amount of AHF variance explained. Results: Our method identified over 100 main GWAS effects that were consistent with prior studies and hundreds of novel gene-by-smoking interactions, which collectively accounted for approximately 10% of AHF variance. This represents an improvement over traditional GWAS whose results account for a negligible proportion of AHF variance. Enrichment analyses suggested that hundreds of miRNAs mediated the SNP effect on various AHF-related biological pathways. The TRACE framework can be applied to decode the genetics of other complex diseases. Availability: All code is available at https://github.com/EpistasisLab/latent_phenotype_project.

20.

ChatGPT and large language models in academia: opportunities and challenges.

Meyer, Jesse G; Urbanowicz, Ryan J; Martin, Patrick C N; O'Connor, Karen; Li, Ruowang; Peng, Pei-Chen; Bright, Tiffani J; Tatonetti, Nicholas; Won, Kyoung Jae; Gonzalez-Hernandez, Graciela; Moore, Jason H.

BioData Min ; 16(1): 20, 2023 Jul 13.

Artigo em Inglês | MEDLINE | ID: mdl-37443040

RESUMO

The introduction of large language models (LLMs) that allow iterative "chat" in late 2022 is a paradigm shift that enables generation of text often indistinguishable from that written by humans. LLM-based chatbots have immense potential to improve academic work efficiency, but the ethical implications of their fair use and inherent bias must be considered. In this editorial, we discuss this technology from the academic's perspective with regard to its limitations and utility for academic writing, education, and programming. We end with our stance with regard to using LLMs and chatbots in academia, which is summarized as (1) we must find ways to effectively use them, (2) their use does not constitute plagiarism (although they may produce plagiarized text), (3) we must quantify their bias, (4) users must be cautious of their poor accuracy, and (5) the future is bright for their application to research and as an academic tool.

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA